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Quian Quiroga et al. [Nature 435, 1102 (2005)] have recently discovered neurons that appear to have the characteristics of 
grandmother (GM) cells. Here we quantitatively assess the compatibility of their data with the GM-cell hypothesis. We show 

1 ithat, contrary to the general impression, a GM-cell representation can be information-theoretically efficient, but that it must be 
accompanied by cells giving a distributed coding of the input. We present a general method to deduce the sparsity distribution of 

j /^ the whole neuronal population from a sample, and use it to show there are two populations of cells: a distributed-code population 
• of less than about 5% of the cells, and a much more sparsely responding population of putative GM cells. With an allowance for 
. y the number of undetected silent cells, we find that the putative GM cells can code for 10 5 or more categories, sufficient for them 
O to be classic GM cells, or to be GM-like cells coding for memories. We quantify the strong biases against detection of GM cells, 



and show consistency of our results with previous measurements that find only distributed coding. We discuss the consequences 
for the architecture of neural systems and synaptic connectivity, and for the statistics of neural firing. 
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1. Introduction 



■ Critical to understanding how information is processed 
in the brain is the form of the neural coding that under- 
O lies the sto rage and recall of memories. Is there a local, 
O pr gnostic ( Konorskl . Il967i) . code — colloquially called a 
1 g randmother-cell (GM-cell) representation — in which the 
firing of a single neuron (or group of neurons) exclusively 



gument fails when one examines the information storage 
capacity of the synapses rather than the representational 
capacity of neurons for input stimuli. The standard effi- 
ciency argument applies only to the input representation, 
needed to represent any of the myriad possible stimuli. For 
storage, a GM representation can be optimally efficient. 

The phenomenological argument is that GM cells should 
fire in response to a much smaller fraction of stimuli 
than has been deduced from measurements of neural re- 



k*" codes recognition of a particular object, person or memory? 

"O Or is the code much more distributed? SDonses ffluirg et al.l,|2005|;|Abbott, Rolls and Tovee|,|l996|; 

^ '• Although it is generally accepted [e.g. JChurchland and SeinolvVM do et aD,l2006|). A GM cell can be regarded as a cat- 
C3 (|l992h ] that GM representations are not used in real- egorizer, and the data appear to imply that any apparent 
ity, experiments ( Hahnloser. Kozhevnikov and Fee) . 120021 : 
Jung and McNaughtonl . 1 19931 : iThompson and Bestl . fl989h 



often find loc alist responses by ind i vidua l neurons. Most 
dramatically, iQuian Quiroga et al. ( 2005 ) have recently 
found many neurons in humans that, within the limits of 
the measurements, behave like classic GM cells. 

In this paper, we therefore quantitatively re-examine 
the viability of GM-cell representations, with the out- 
come that we refute the standard quantitative argu- 
ments against them, both theoretical and phenomeno- 
logical. The information-theoretic argument is that GM 
representations ne ed far too many neuro n s for t he in- 
formation coded dRolls and Trevesl Il998t iRollsl . 12001 : 



GM cell responds to many categories of stimuli rather than 
to one category. 

However, our information theoretic argument shows 
that associated with any GM-cell population, with its 
ultra-low sparsity, is a more conventional population with 
a much higher sparsity. This two-p opulation prope r ty, al- 
ways a part of the GM-cell idea (jKonorskii Il967t |Page, 
2000), was not al l owed for in older analyses, including 



that dWavdoet al.l.l200d) by the group responsible for the 



data (j Quian Quiroga et al . 20051 ). We devise a very 



general method of analyzing neural s ystems with multiple 



sparsi ties, and apply it to the data of I Quian Quiroga et al 



Churchland and Seinowskif 19921 ). We show that this ar 



(2005). It enables us to quantify the biases against exper- 
imental detection of GM-like cells, most of which simply 
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appear a s unreported silent cells, and whose estimated 
numb ers ( Wavdo et al. . 20061 ; Henze et al. . 200dl Buzsakil . 
2004) — may be a factor of 30 more than the reported cells. 

Wc find that the two-population property holds, and that 
less than 5% of detected cells are in the distributed-code 
population: the vast majority of the cells can be GM-like. 
Then we find that the number of categories coded by the 
GM-like cells can be 10 5 or more. Uncertainties are mi- 
no r relative to the orders of m agnitude involved. The data 
of lOuian Quiroga et al.l (|2005l ) therefore appear in strong 
quantitative agreement with the GM-cell hypothesis. 

The biases against detect ing GM cells are enough to allow 
consistency with pr evious (|Abbott. Rolls and Tove3 . ll996l : 
Way do et al. . 20061 ) measurements and analyses that use 
a single-population model and that quantitatively argued 
against GM cells. We will examine other arguments against 
GM representations in the Discussion section. 

2. GM systems 

Inappropriate or excessively rigid definitions can 
exclude biologically i nteresting cases. For example, 
Rolls and Treves! ( 1998 . p. 12) define a local representa- 
tion as one where "all the information that a particular 
stimulus or event occurred is provided by the activity of 
one of the neurons" . The word "all" appears to exclude 
a system where a local representation codes the result of 
testing a distributed representation of a stimulus against 
remembered items. Not all the available information is in 
the firing of the neurons in the local output representation. 

Therefore in this section we present our definitions, and 
explain important features and consequences of the defini- 
tions needed for later sections. 

2.1. Definitions 

We define a set of cells to form a local or GM-like sys- 
tem when nodes of the system can be divided into groups 
of one or more nodes, and, to a good approximation, each 
group corresponds to one particular meaningful and dis- 
tinct property of the stimulus input to the system. Each 
group we call a GM group. Typically we treat the properties 
corresponding to different GM groups as being mutually 
exclusive. We will usually identify the nodes with actual 
neurons, so that measurement in one of a GM group's cells 
of firing above a suitable threshold is strong evidence that 
the stimulus is associated with the corresponding property. 
But it is also possible that the nodes could be, for example, 
part of a dendritic tree. Then the correspondence between 
neural firing and local coding might not be direct. In any 
case we can treat the system as a categorization system. 

In contrast, a distributed representation is formed by a 
set of cells where the categorization can only be determined 
from the activity of multiple cells/nodes and where the pat- 
terns of activity overlap between distinct properties, even 
when the properties themselves are mutually exclusive. 



Classic GM cells and generalizations The classic exam- 
ple of GM system is a facial identification system, where 
the firing of a particular GM group mediates the "unitary 
perception" (jKonorskil . of a retinal image as corre- 

sponding to a particular individual person. A characteris- 
tic property of a classic GM system is therefore that firing 
is exclusive between different GM groups — i.e., only one 
GM group is active at a time. This corresponds to the fact 
that the individuals associated with the different stimuli 
arc themselves completely distinct. Of course, there will be 
situations where the GM group firing is not completely ex- 
clusive; for example, if a particular stimulus is ambiguous, 
or if the picture of a face of one person is artificially mor- 
phed into that of another person. 

Our definition, however, was worded to allow certain 
natural generalizations from the case of classic GM cells. 
In particular, we will apply the terminology to declarative 
memories in general (episodes, facts, etc). Thus we could 
have a GM cell or GM group corresponding to each episodic 
memory. In this case it is evidently of practical importance 
for a memory to be recalled from a stimulus containing a 
few components of the original memory. Since the same 
components could be part of other memories, the pattern of 
recognition firing need not be exclusive between memories. 
Let us regard these patterns as priming the recall of the 
memories. Full conscious recall of one particular memory 
requires some extra cues and modulation. With a GM-like 
memory system, non-exclusive priming recall would be at 
a relatively low level of firing above some threshold, with 
full recall involving exclusive firing at a much higher level. 

In this case the exclusivity is not between the actual 
firing of different GM groups, but between the concepts 
corresponding to the groups. Of course, the situation is a 
little more complicated for episodic memory, since episodes 
are happenings along a continuum in time. A memory cell 
for an episode corresponds to a small range in time. The 
exclusivity between different GM groups is between well- 
separated episodes, of which there are evidently a very large 
number. 

Another natural generalization of the GM-cell concept 
is to local coding for o utput, with strong experimental 
eviden ce in the work of lHahnloser. Kozhevnikov and Fee] 
<l2002h . 



High-level GM systems v. low-level local coding We 
choose to restrict our use of the "GM-cell" terminology to 
the higher levels of neural processing. For lower levels, we 
will use the broader term "local coding" ; by this we mean 
coding by a set of cells each of which is responsive to some 
patch of stimulus space (with fall off at the edges natu- 
rally) . When the relevant properties of the stimulus are in 
a low-dimensional space — as for color or position in an 
environment — the collection of patches can give complete 
coverage. But when the relevant stimulus space is high di- 
mensional, one can only expect local coverage of a minute 
fraction of possible stimuli — for example, to correspond 
to linguistic phonemes out of all possible auditory stimuli. 
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Fig. 1. Cell G is a classic GM cell that fires in response to a certain 
kind of visual pattern (e.g., the face of a particular person). It 
receives processed input from the top of a hierarchy that processes 
raw sensory input. The GM cell passes recognition information to a 
set of output cells, which could, for example, recall the name of a 
recognized person. 

There is considerable evidence that can be interpreted 
as supporting some kind of local coding below the highest 
perceptual levels. The key question for us is whether local 
coding is also used at the highest levels. 

Internal stimuli Input to a memory system need not 
correspond to actual physical events external to the brain. 
Some input can be completely generated internally, as 
with an author planning a novel. Memories of people and 
episodes in the novel have the same neural status as mem- 
ories of real people and events. 

2.2. Properties 

One big computational advantage of a GM-like sys- 
tem is the simplicity of information flow, Fig. 1, so 
that the represent ation is easy to construct a n d ma - 



nipulate - - e.g.. iGardner-Medwin and Barlowl (|2001[ ); 
Baum. Moody and Wilczekl (|1988 ). New memories are 
formed with new or unallocated neu rons, so that they inter- 
fere m inimally with old memories (jQuartz and Seinowskil . 
19971 ). The input to GM cells is from a distributed repre- 
sentation of the stimuli. Output to downstream neurons 
using the categorization can be simply taken from a single 
GM cell. 

Above-threshold firing of a GM-cell models a property of 
the cause of the stimulus. Thus when firing is exclusive be- 
tween different GM groups, this corresponds to exclusivity 
between the modeled properties of the associated external 
stimuli, like the identity of a person. 

It is therefore tempting to use exclusivity of firing be- 
tween different GM groups as the primary measurable cri- 
terion for characterizing GM-like systems. But natural gen- 
eralizations to declarative memory motivate us to relax this 
criterion. For example, with episodic memories, a stimulus 
may cause a response in nodes for those episodes having 
important commonalities with the stimulus. 

In addition, there can be multiple categorization systems, 
and there is no requirement of exclusivity between different 
systems. This is particularly clear at low levels in the pro- 
cessing hierarchy, where we could have separate local rep- 
resentations, for example, of the color and shape of an ob- 
ject. A collection of local representations of such relatively 



low-level features then forms a distributed representation 
of the whole object, suitable for input to the next level of 
hierarchical processing. 

Note that we prefer to use the terminology "GM-like 
system" rather than "GM representation" to emphasize 
two aspects: The first is that if the GM-cell idea applies 
in its classic sense of recognition of individual people, it 
is likely to apply much more generally to all declarative 
memories. 

The second is that we wish to treat the firing of a GM 
cell as coding the result of a recognition computation from 
a stimulus. But the use of the word "representation" for a 
GM-like system would carry the connotation of represent- 
ing the stimulus, which is not generally appropriate. If noth- 
ing else, local coding typically dramatically fails to cover 
the stimulus space. For example, consider a distributed in- 
put representation on 100 binary neurons, a small number 
compared with real sensory systems. There are 2 100 ~ 10 30 
distinct stimuli, many orders of magnitude larger than the 
total number of neurons in any brain. A practical local rep- 
resentation can only apply to a minute fraction of stimuli, 
presumably ones that are especially salient. A local repre- 
sentation can only provide full coverage for a stimulus space 
of very low dimension, like that for color. 

2.3. Detection of GM-like systems 

Practical experiments only involve a limited number of 
stimuli and cells, and definitely do not give the detailed 
synaptic information that determines all possible causes of 
a cell's firing. Thus it is non-trivial to distinguish a GM 
system from a distributed representation, when the dis- 
tributed representation is sparse, and when we allow natu- 
ral generalizations of the GM-cell idea. 

We illustrate the issues by comparing two very differ - 
ent comput ational memory models, one byHopficld ( 2006), 
and one bv lBaum. Moodv and Wilczekl |l988|) (BMW). 



Hopfield's model In Hopfield's particular example, the 
input stimulus concerns properties of people, and the input 
representation is carried by a set of 1000 binary neurons. 
These are divided into 50 sets of 20 neurons. The 20 neurons 
in each set give a local representation of 20 possible values 
of a property of the stimulus. For example, to input the 
name of a person, one of a set of 20 name-coding neurons 
would be active. Binary synapses connect every neuron to 
every other neuron. 

Each stored memory is considered as the set of 50 prop- 
erty values for a particular person, and is coded in the 
state of the synapses. The synaptic strengths are set by a 
Hebbian-like rule on presentation of stimuli. 

Recall of a memory is caused by a stimulus that consists 
of partial data about an individual, i.e., values for a subset 
of the 50 properties. Memory retrieval results in completion 
of a partial stimulus to the full set of properties for the 
corresponding individual. There is no corresponding GM 
cell; the model is a fully distributed memory system. 
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Fig. 2. Architecture of BMW model in its most basic form. 

BMW model The basic BMW model is the simplest pos- 
sible form of a GM system: a feed-forward perceptron with 
one intermediate memory-cell layer, Fig. 2. The memory 
cells are arranged to respond in a GM-cell style to recog- 
nize inputs that correspond to stored memories. The model 
gains its power by applying it to the situation that the in- 
put forms a sparse binary representation of stimuli. This 
would typically be obtained from the top of a processing 
hierarchy, as in Fig. 1. Improvements in the model can be 
made, for example, by adding inhibitory interneurons to 
enforce a winner-take-all action in the memory-cell layer, 
but the simplest form of the model is sufficiently robust for 
our illustration. 

To solve the same pattern-completion task as Hopficld's 
model, we make the output cells identical to the input cells, 
thereby specializing from a heteroassociative memory sys- 
tem to a homoassociative system. 

Memories are created by presenting a full stimulus to 
the system, and arranging for some unallocated memory 
neuron a to match its synaptic strengths to the stimulus. 
The chosen neuron changes to a state of being allocated, 
with correspondingly greatly reduced synaptic plasticity. 
Let x( Q ) = . . . , ^iooo) denote the prototype pattern 

for memory neuron a. Then the strengths of the neuron's 
input synapses are set to a constant times the input pattern: 



U' 



Wx^\ and similarly for the output. The constant 



W can be scaled out of all of our formulae, but its use assists 
in relating the formulae to properties of real neurons. When 
another stimulus x' is presented, the response of neuron a 
is obtained from a thresholded sum over its inputs: 



E 



(1) 



This formulation is for artificial analog neurons, but it is 
readily generalized to realistic spiking neurons, and the 
model is robust enough that minor variations do not affect 
the principles governing its performance. 

We now review how standard properties Kanerval (|l988l ) 
of sparse binary codes show that the system implements 
the pattern completion task if the threshold is set suitably. 
Let a stimulus x' be presented; it may be some full pattern, 
in which 5% of the input neurons are active, or it may be a 



partial pattern. The input to neuron a is Y) j w a J x^, which 
is W times the overlap between x' and the prototype pat- 
tern for memory a, i.e., it measures the number of common 
on-bits. 

If the new stimulus is unrelated to memory a, then its 
on-bits are random relative to those for memory a. Thus for 
a full pattern the typical overlap is 5% of the total number 
of on-bits in a full patten, i.e., 5% x 50 = 2.5, and less for 
a partial pattern. 

We therefore set the threshold intermediate between the 
input for the full prototype pattern, i.e., 50VF, and the 
typical input for a full unrelated pattern, i.e., 2.5W. Then 
unrelated patterns almost never cause neuron a to fire. 
Thus with sparse input patterns we have the well-known 
automatic orthogonalization against unrelated stimuli. 

But now consider an input corresponding to a subset of 
the features in memory a's prototype pattern. For example, 
a quarter of the features could have been detected. In that 
case, the input to neuron a would be \ X 50VF = 12.5W. 

Hence with an appropriate threshold setting, we get firing 
of neuron a in response to a subset of the features in its 
prototype pattern. This then causes firing of the whole set 
of output neurons that correspond to the prototype, i.e., 
we have pattern completion. 

Because the input to a memory cell is much bigger for 
patterns related to its prototype than for unrelated pat- 
terns, the operation of the system is robust against changes 
in the exact equations describing its operation. 

Measurements To mimic a biological experiment, one 
could measure the responses of a sample of model neurons 
to a sample of stimuli. In Hopfield's model if there are fewer 
than about 10 or 20 stimuli, the firing of many of the in- 
dividual neurons would correspond to single people. Thus 
the distributed nature of the memory representation would 
not be immediately apparent. 

But it is not necessary to enlarge the stimulus set to 
test this. Data from a sample of neurons and stimuli give 
the fraction of stimuli to which neurons respond. Then by 
using knowledge of the capacity of the system, which in 
this case is several hundred memories, one can statistically 
extrapolate from the sample to show that each neuron is 
responsive to multiple unrelated stimuli. 

As for the BMW model, the input/output cells would 
have the same kind response characteristics as all the cells 
of the Hopfield model, responding to 5% of the stimuli. But 
there are also GM memory cells, which respond much more 
rarely. If the sample data include at least a few responsive 
GM neurons, one can detect the different properties of the 
input and GM cells, and thereby distinguish the system 
from the Hopfield model. 

In this paper, we will construct a method to extrapolate 
in general from data with a sample of cells and stimuli to 
the whole system. The method enables us to compute which 
of the following properties of a set of putative GM cells is 
consistent with data: (a) The cells actually could code for 
single properties, (b) Each cell is expected to respond to 



4 



multiple unrelated properties, after an extrapolation to the 
full set of possible stimuli. 

One confounding issue is that if one detects a response 
from a GM-like cell it is easy to misidentify the property 
corresponding to the cell's firing. If a cell only responds, 
within limited data, to a particular individual person, the 
cell could indeed be a classic GM cell corresponding to that 
person. But it could also be, for example, a GM cell for an 
episodic memory that contains the person. In that case it 
would respond to stimuli containing other components of 
the episode. 

The distinguishing feature of the most general kind of 
GM-like cell, but one that can be hard to test, is that ap- 
parently different stimuli that cause it to respond are in 
fact related. In contrast, with a purely distributed repre- 
sentation at the highest level, there is no high-level relation 
between, for example, the multiple people causing a par- 
ticular cell to respond. The only relation is an identity of 
lower-level features in the input representation. The local- 
ity is at the feature level, not at the high level. 



collection of feature representations would always form a 
distributed representation. Each neuron has a range of dis- 
tinguishable firing rates, so that the raw information ca- 
pacity in the activity of N neurons is a few times N bits. 
But robustness requires a certain amount of redundancy, 
and firing is often sparse, both of which reduce the informa- 
tion p er neuron. Measurements ( Abbott. Rolls and Toveel 
1996) show that in hippocampal neurons about 0.3 bits of 



independent information are coded per neuron and suggest 
that a few hundred neurons suffice for coding possible faces. 
For example, 300 neurons code about 100 bits, sufficient for 
about 2 100 ~ 10 30 different categories of face, an entirely 
satisfactory number. 

Note that once an input representation is sufficiently 
small, pure representation efficiency is not a dominant con- 
sideration. Issues of processing speed, metabolic efficiency, 
and algorithmic robustness can be more important. For ex- 
ample, sparse distributed repre sentations appear to be fa- 
vored ( Rolls and TrevesL 1998), since they can give weak 
interference between memories during synaptic plasticity. 



3. Information requirements for pattern 
recognition 

We now quantify the information requirements for a 
memory or categorization system. First we distinguish the 
activity state and the storage state. The activity state is 
the pattern of firing of the neuron s, and the storage state 
( Churchland and Seinowskl . 1992 . p. 142) is the pattern of 
synaptic connectivity and strengths. 

Furthermore, within the activity state of a memory sys- 
tem, we distinguish an input representation and a recogni- 
tion representation. The input representation is of the cur- 
rent input stimulus, like a visual scene, while the recogni- 
tion representation concerns which stored pattern (if any) 
corresponds to the current input. 

3.1. Input representation 

The immediate input to a memory system must be able 
to represent the relevant features of any possible stimulus, 
and not just those previously encountered stimuli for which 
there is a memory trace. Here the standard arguments for 
distributed representations apply unambiguously, and a 
GM representation is not possible. The argument is simply 
that N neurons code for at most N exclusive properties in 
a GM system, but that they code for exponentially more 
with a distributed representation, for example 2 N with a 
simple binary code. 

In general the immediate input to a memory or catego- 
rization system is not the raw sensory input but a highly 
processed representation of those high-level features that 
are relevant for the system's particular task. For example, 
input for face recognition could involve neurons coding for 
the presence, shape and position of eyes, nose and mouth, 
etc; individual features could well be coded locally, but the 



3.2. Storage in synapses 

The information capacity in the storage state, i.e., in the 
synap ses, was e stimated bv IChklovskii. Mel and Syoboda 
(|2004 and by IStepanvants. Hof and Chklovskiil (|2002f ). 
There are two contributions, from the synaptic topology 
and from the synaptic strengths, giving a total of 5 to 10 
bits of raw storage capacity per synapse. We divide by 10, 
to provide a plausible allowance for redundancy. Thus we 
need about one synapse for each bit of storage information. 

This calculation uses only very basic physical informa- 
tion about synapses and neural processing, so it is certainly 
accurate at the order of magnitude level. Since measures of 
information in units of bits are independent of the physical 
implementation, the numbers are directly compared with 
those for ordinary digital computers, and experience with 
data processing and storage can be used to derive minimum 
numbers of synapses for a task. A human brain has around 
N to t — 10 11 neurons and C ~ 10 synapses per neuron. 
So its synapses store about CN to t — 10 15 bits, i.e., about 
10 5 GByte, up to a factor of 10 or so. 

3.3. Recognition representation and total stored 
information 

Suppose the system has a repertoire of R stored memo- 
ries. Each is an arbitrary association of a category to stim- 
ulus features. So we attribute to each memory A bits of 
association information, which we term its semantics. This 
includes both input and output information. (E.g., facial 
structure and name for a person, and, in fact, all remem- 
bered information about the person.) It is important to in- 
clude output semantics in the associations, since they are 
what allow the retrieval of a particular memory to cause 
memory-specific behavior. These are quite arbitrary asso- 
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ciations, for example of a linguistic name to a specific per- 
son, and without such output associations there is no pur- 
pose to storing a memory. The output associations do need 
to be stored, and they therefore contribute to the calcula- 
tion of the minimum size of the system just like the input 
associations. 

The total association information is RA bits, so that ap- 
proximately this number of synapses is needed. For exam- 
ple, with a repertoire of R — 5000 and an average of A = 
10000 bits of information per item, we need about 5 x 10 7 
synapses. The measured number of synapses per neuron is 
around 10 4 , thereby implicating about RA/10 4 neurons, 
i.e., a number proportional to the size of the system's reper- 
toire (and the information per memory trace) . 

Hence if the system uses R GM cells for a recognition 
representation, this is at most a constant factor beyond the 
neurons needed to carry the storage synapses. Moreover, if 
A is larger than about 10 , there is not even any overhead 
at all in using GM cells. We can characterize this by saying 
that if the memories are semantically rich, then a GM strat- 
egy for the recog nition output can be optima l ly effi cient, 
as in the model of iBaum, Moody and Wilczekl (|1988T ). 

In contrast, if one ignored the storage requirement, one 
would assert that N recognition neurons can code expo- 
nentially more recognized categories, e.g., 2 N . 

It is important that in quantifying the information the 
term "bit" is used in the strict information theoretic sense. 
This means that if each memory were coded as an actual bit 
pattern, each of the 2 A possible patterns would be equiprob- 
able. Thus, on a computer, the bit count refers to an op- 
timally compressed representation. However, only certain 
features of the input are relevant for tasks like face identi- 
fication: a minimalist line drawing often suffices for unam- 
biguous identification. So in computer terms the association 
bits are with respect to a representation that is both lossy 
and compressed. In the opposite direction, it can be diffi- 
cult to perform computations on optimally compressed rep- 
resentations, and it is also difficult to measure accurately 
the probabilities of occurrences of different kinds of stimuli, 
since the number of possible stimuli far exceeds the num- 
ber actually experienced. Moreover redundancy (in the in- 
formation theoretic sense of using more than the minimum 
necessary number of bits) is useful in giving robustness to a 
system. For all these reasons, we must expect the physical 
capacity of a system needed to code memories to be a sub- 
stantial factor large than a minimal physical implementa- 
tion of RA bits. Nevertheless this measure is important in 
quantifying information in an implementation-independent 
way. It also enables us to estimate the information stor- 
age requirements by examining implementations of related 
tasks on a digital computer. 

Well-known examples of simple line drawings and pic- 
tures of artificially low resolution show that the informa- 
tion to identify faces could be quite modest, if a suitable 
representation is used. But the synaptic size of the sys- 
tem also depends on the remaining association informa- 
tion, treated as output. This can be much larger in size, 



effectively amounting to a biography of the individual con- 
cerned in each memory. 



3.4. Objections and answers 

A number of objections to our bounds and ideas for evad- 
ing them have been proposed, which we now answer. The 
general answer is simply that the information theoretic 
bound represents an absolute physical limit that it is im- 
possible to exceed. All that is required is that we count the 
bits of information in the strictly correct information theo- 
retic sense, and that we have identified the correct physical 
location of memory in the synapses. 

Counting memories Does not the idea of quantifying 
memories as discrete items that can be counted carry 
the implication that we use a GM system? Are there not 
difficulties in counting memories in distributed memory 
systems? In fact the cla ss ical kinds of dist r ibuted mem - 
ory, e.g.. iHopfieldl (l2006h: jTreves and Rolls! (|l99ll Il992h ; 
Amit. Brunei and Tsodvkd Tl994). are regarded as storing 
patterns, which can be counted. What we do have in mind 
is declarative, or explicit memory, the kind considered as 
prototypically hippocampal. Here it is reasonably clear 
what is meant by a single memory: a picture, a scene, 
or the meaning of a word; all of these are discrete. Even 
with episodic memory, where there is a continuous time 
variable, we can observe that there is a correlation time 
within a continuous series of events. As regards storage 
requirements, we simply identify a single memory with the 
happenings within a correlation time, which is evidently 
of the order of seconds or minutes. Of course, only a small 
fraction of these are stored in long term memory. We will 
not require great precision here. 

In the contrasting case of implicit procedural memory, 
it is much less obvious what should be defined as a single 
memory item. But we are not concerned with this case. 



Multiplexing Could not one gain by allowing a neuron 
to respond to multiple different stimuli? Could not a sin- 
gle face-identification cell respond to either George Bush, 
Jennifer Aniston or John Hopfield, for example? This mul- 
tiplexing is just going in the direction of a distributed rep- 
resentation. Our argument so far does not rule that out; all 
it says is that this does not provide a way of reducing the 
synapse count. It is our statistical argument in later sec- 
tions that enables us to estimate the degree of multiplexing. 

Now if a neuron responds to multiple categories, then 
there is interference at the neural level between different 
memory traces. An unambiguous categorization then re- 
quires the use of the firing information from more than one 
neuron. In the case of lightweight memories, i.e., with sub- 
stantially less than 10000 bits of associations per memory, 
multiplexing of memories can indeed reduce the neuron 
count, and is allowed by our general argument. But with 
richer memories, there is no gain. 
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Representation v. memory The memories we have dis- 
cussed are typified by hippocampal memories. Consider in- 
stead visual area VI; the number P of distinguishable acti- 
vation patterns is exponential in the number N of neurons, 
e.g., P = 2 . An outside observer could identify different 
faces from the different activation patterns. Why should 
we not regard this as memory, if we are to regard observa- 
tion of (much simpler) activation patterns in GM cells as 
identifying faces, and as in fact part of a memory system? 

The difference is not in the outside observations, but in 
the use the organism itself makes of the information. A 
memory is not useful unless it produces some consequence 
when recalled. What we mean, for example, is that seeing 
John Hopfield on the other side of a street might induce us 
to cross the street and greet him by name. For the billions 
of other possible people, there would be either be a different 
response or no response at all. 

This is why we defined the associations for a memory to 
include output as well as input. There must be sufficient 
information stored to enable the relevant computation to 
be done from the activation pattern corresponding to the 
stimulus. 

Without being concerned with storage, it is perfectly 
possible to compare one activation pattern with a previous 
one, to identify whether or not it has changed. But to com- 
pare it with patterns of activation for all the people one 
remembers, and to take appropriate actions, one needs ap- 
propriate storage. It is simply not possible to evade the fun- 
damental physical necessities given by information theory. 

Coding commonalities Many remembered faces have fea- 
tures in common. Cannot this be used to reduce the num- 
ber of synapses and neurons needed, by coding the common 
features in a special subsystem for the common features? 
Suppose we had a set of faces characterized by very short, 
dark, and curly hair. Would there not be a gain by allo- 
cating a neuron to this combination of characteristics and 
using it instead of separate neurons for hair length, color 
and curliness? 

If the different hair features were equiprobable and un- 
corrected, the general argument would prevent any gain. 
As an example, suppose each of the three hair characteris- 
tics has 8 equiprobable values, for 8 3 = 512 distinguishable 
combinations. If all the combinations were equiprobable, 
we would allocate 9 bits for the information content. Prob- 
ability here refers to a prior probability of occurrence, i.e., 
before the creation of a memory trace. 

But if instead all the characteristics were perfectly cor- 
related, then there would only be 8 combinations, which 
could be represented in 3 bits. When we construct memo- 
ries, this gives a gain of a factor of 3 in storage if we only 
represent the combinations that actually occur in the input 
representation rather than all possible combinations. 

But this is exactly what is meant by using a correct 
measure of information to compute the minimum number 
of synapses. Note that the task carried out by a memory 
system is not merely to identify the best fit to a current 



stimulus among stored memories; for that very few bits are 
needed. For example, if a stimulus is represented by 100 
bits, but only 8 memories are stored, then only log 2 8 = 3 
suitably coded bits are needed to identify the stimulus, if 
it is assumed that the stimulus corresponds to one of the 
memories. But it is also necessary to identify the case that 
the stimulus fails to correspond to a stored memory, so that 
it is a candidate for a new memory. It is for this that the 
other 97 bits are needed. 



3.5. Comparison with models 

Although our argument was used to show that GM 
coding can be optimally efficient in the use of synapses 
and neurons, the bounds on synapse and neuron num- 
ber are independent of the coding method. It is therefore 
useful to verify that the bounds are obeyed by the polar 
opposite of GM systems, i.e., by conventional distributed 
memory systems. Calculati ons of the capacity of such sys - 
tems have been made, e.g.jTreves and Rolls! (|199ll Il992l ); 
Amit. Brunei and Tsodvksl ( 1994 ). and it can be checked 
that they do obey our bounds. But what appears to have 
been missed is that the capacity limit also removes the 
argument against using G M cells. 

Hopfield's recent model (|Hopfieldl . [200l provides an ex- 
cellent example. In this model, each stored memory corre- 
sponds to the values of each of 50 categories, with 20 pos- 
sible values per category. This gives a total of 50 log 2 20 = 
216 bits of information. Retrieval of a memory results in 
a neural representation of this information in the firing of 
50 out of 1000 binary neurons, for a biologically realistic 
sparsity of 5%. 

Storage is in binary synapses with all-to-all connectiv- 
ity on the 1000 neurons, for a total of slightly under 10 6 
bits of storage capacity. This implies that the system can 
store at most 10 6 /216 ~ 4500 sep arate memories. In fact, 
the simulations in iHoofieldl (l2006h show that, with the al- 
gorithms used in that paper, performance noticeably de- 
grades when about 250 memories, i.e., a factor of 20 below 
the physical limit. Thus the information-theoretic bounds 
are obeyed. The problem is that the memories are stored on 
the synapses connecting the active neurons in a particular 
memory. Synapses overlap between memory traces, which 
causes interference if too many memori es are stored. 

Co nsider in contrast the BMW model ( Baum. Moodv and Wilczekl . 
1988). which, as observed by its authors, is optimal in its 
number of synapses. Suppose for input and output we use 
the same 1000 neurons as in the distributed model. Then to 
store N memories we add N memory neurons and 2000-/V 
synapses (for input and output). We also may use a rela- 
tively few extra interneurons and synapses to implement 
winncr-take-all dynamics. 

The number of synapses is double our minimum estimate, 
because we treat input and output semantics separately: 
they could be different. This is much more efficient than the 
distributed model. There is also an extra neurons for each 
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memory. But we can readily increase the capacity for stored 
memories by increasing the number of memory neurons. 
With N — 500, we would have the same number of synapses 
as in Hopfield's model, 50% more neurons, but double the 
capacity. 

With the distributed model, the capacity can be in- 
creased only either by simply duplicating the system, 
which is always a possibility, or by a change in architecture. 
An appropriate change in architecture would be to use the 
original input neurons to feed a separate layer which codes 
the same information more sparsely, after the style of a 
support vector machine. 

Note that the BMW model reliably performs the same 
pattern completion task as Hopfield's model. That is, a 
stimulus consisting of a few values is completed to the val- 
ues of all categories for the corresponding memory. The 
BMW model essentially performs a comparison of the ac- 
tive bits in the sparse input pattern with the on-bits in the 
stored pattern. Because of the 5% sparseness of the input, 
the probability that an on-bit in the stimulus coincides with 
an on-bit in an unrelated stored pattern is also 5%. Once 
more than a few bits are examined, the probability of a 
chance coincidence is extremely small, with a correspond- 
ingly small misidentification probability. 

4. Statistics of neural responses 

4.1. Two-population property of GM systems 

We have seen that the efficiency argument against GM 
cells disappears, especially for semantically rich objects, 
like most people's grandmothers. But the efficiency anal- 
ysis for the input representation shows that the GM cell 
population is necessarily accompanied by a population of 
cells carrying a distributed code. 

Experimental characterizations of the different kinds of 
coding can be made by measuring the sparsity of neural 
responses to stimuli. 

By sparsity we mean, for each cell, the fraction of 
stimuli to which it respo nds. We assume that, as in 



Quian Quiroga et alj (|2005l ). some threshold criterion is 
defined cell- by-cell as to whether a cell responds or not. 
Thus the neuron is treated as binary. Other definitions in- 
volving analog firing rates are possible, but we will not use 
them. Note that with our definition and when all cells have 
the same sparsity, both the popu lation and the lifetime 
spars ity are equal, unlike the case (jWillmore and Tolhurstl 
20011) with other definitions of sparsity. 

For a system with fully distributed coding, we expect 
to measure sparsities characteristic of the input and out- 
put representations. For example, in Hopfield's model, the 
sparsity is exactly 5%. More realistically, there will be a 
range of sparsity. 

The input and output cells of a GM system will have 
naturally have similar sparsities to those of all the cells in 
a distributed-memory system. But the GM cells must re- 



spond much more rarely. So with a GM system we expect 
there to be two very different populations of cells distin- 
guished by one population having a dramatically smaller 
sparsity than the other. Whether or not the two popula- 
tions are in the same area of the brain is not determined by 
general argume nts. But we will find that, in fact, the puta- 
tive GM cells of lQuian Quiroga et al.l (|2005l ) do have an ac- 
companying distributed-code population. For our purposes 
it will be unimportant whether the detected distributed- 
code population is the one that provides the actual input 
and output for the detected GM-like cells. 

Expectations for the sparsity of GM cells can be provided 
in terms of the repertoire size of a system, which has a 
connection to behavioral data. 

There are two somewhat different kinds of memory sys- 
tem we will consider. One is typified by face recognition, 
where a recognized input is categorized into one of R cate- 
gories, corresponding to distinct persons; recognition is ex- 
clusive between categories. Then for a random sample of 
faces in the repertoire of the system the sparsity of the GM 
cells is 1 /R, if we assume that each person is allocated the 
same number of cells. 

The second case is for declarative memory (episodes, 
facts, etc). Recall is by a stimulus containing a few compo- 
nents of the original memory. Since the same components 
could be part of other memories, the pattern of recogni- 
tion firing need not be exclusive between memories. Let us 
regard these patterns as priming the recall of the memo- 
ries. Full conscious recall of one particular memory requires 
some extra cues and modulation. With a GM system, non- 
exclusive priming recall would be at a relatively low level 
of firing above some threshold, with full recall involving ex- 
clusive firing at a much higher level. 

In a memory system, a typical stimulus can evoke multi- 
ple memories. If we let n m typify the number of memories 
evoked, then a given GM neuron is caused to fire by a frac- 
tion n m /R of stimuli. It is therefore convenient to define an 
effective repertoire size R e s = R/n m , so that the typical 
sparsity is 1/ i? c ff • 

From standard psychological data, we envisage that R c g 
is thousands to at least millions for interesting cases. 

In contrast, the distributed-code cells fire much more 
frequently; this is known from data, and is necessary in 
order that this population can represent a sufficiently large 
number of stimuli. 



4.2. General form of distribution of neural responses 

Measurements of single-cell responses concern only a 
small fraction of cells and of all possible stimuli. So we will 
treat data as being from a sample over cells and stimuli, 
and deduce properties of the whole system: e.g., the relative 
sizes of the cell populations and their sparsities, and hence 
the number of categories coded for by the GM population. 
In doing this, we will quantify, and hence compensate for, 
the strong biases against detection of GM cells. 
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Suppose we present a sample of p stimuli that are ran- 
domly chosen from some broad class (e.g., pictures of fa- 
mous people, images concerning movies that the subject 
has watched, pictures of buildings). Any particular cell i 
responds to some fraction of these, called the cell's (life- 
time) sparsity on. The number m of stimuli that evoke a 
response by the cell is taken from a binomial distribution 
of mean cunc 
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pi 
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(2) 



This simply corresponds to the probability that n of the 
stimuli are in the response-causing class and p—n are in 
the non-response-causing class, as regards cell number i. 
These two classes of stimuli form fractions and 1 — a, of 
the whole set of stimuli. 

We now consider a sample of cells, thereby sampling the 
distribution of sparsity over cells, D(a). This means that 
the fraction of cells with sparsity a to a + da is D(a)da. 
Then the probability of getting n responses to the p stimuli 
in some random chosen cell is obtained by integrating the 
single cell response with the sparsity distribution: 

P(n\p)= [ daD(a)a n (l-ar- n ^^ 1 (3) 



n\ (p - 



This is a general result. The only necessary assumption is 
that the cells are randomly chosen out of some more global 
set of neurons (e.g., hippocampus) and that the stimuli are 
randomly chosen out of some global class. 

The value of a for a cell and the distribution D(a) de- 
pend both on the choice of stimulus class and on the choice 
of the threshold for a response. Changing either will natu- 
rally affect the distribution. For the data we analyze, th e 
response criterion is given in Quian Quiroga et al.l ( 2005j ). 



If multiple sessions and multiple subjects are considered, 
Eq. (3) continues to apply, with D{a) being the distribution 
averaged over subjects. So this form is amenable for the 
analysis of aggregated data. 

Observe also that the derivation of the formula does not 
require any assumption about the independence of the fir- 
ing of different neurons: the formula is simply an average 
over all neurons in whatever area is being sampled. This 
allows the form ula to be completely general, in contrast 



to the model of IWavdo et al.l (|2006l ). which requires that 



neuron-neuron correlatio ns be neglecte d 



A common ansatz, as in lWavdo et al.l (|2006l ). is to assume 
a fixed sparsity a%, i.e., to set D(a) = 5(a — a\). Such 
a model we term a single-population model. But for an 
analysis of a possible GM population, we must allow for at 
least two populations. 

From a mathematical point of view, Eq. (3) expands P(n) 
in basis functions, with expansion coefficients D(a). The 
significance to its use is four fold: (1) It relates the distribu- 
tions for different pattern numbers p via a common set of 
expansion coefficients D(a). (2) The expansion coefficients 
are non-negative. (3) For a distributed-code population to 
represent all possible stimuli, efficiency is important, i.e, 
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Fig. 3. (a) Qualitative expectation for distribution of sparsity over 
cells with a GM population and a distributed-code population, (b) 
Idealization with fixed sparsity for distributed-code population. For 
a GM population with a large repertoire, its sparsity is too close 
to a = for the left hand peak to be be shown correctly scaled on 
these graphs. 

using the minimum of neurons. This will tend to maximize 
the sparsity subject to other co nstraints like k eeping rel- 
atively low the metabolic costs ( Lenniei . 2003 ) of action- 
potential generation. Thus we should expect the sparsity of 
the distributed-code population to vary over a fairly nar- 
row range. (4) Any GM population has an extremely small 
sparsity, so that it populates just the bins with n = and 
n = 1 responses to the p stimuli. 

Therefore the two population property leads to the qual- 
itative expectation for D(a) that is shown in Fig. 3(a). The 
distribution of responses P(n) can be regarded as a smeared 
version of the sparsity distribution D(a) with a = n/p. 
Given this smearing, a useful approximation is to replace 
the distributed-code peak by a delta function at some fixed 
typical sparsity, Fig. 3(b). 



4.3. Useful approximations 

Although numerical work can always be done with the 
linear combination of binomial distributions Eq. (3), we 
find it convenient to use one of two approximations. First, 
they allow simple analytic calculations, with a consequent 
ease of understanding what features of the data are impor- 
tant in determining particular parameters in a model of the 
sparsity distribution D (a). Second, they also exhibit that 
for sufficiently small sparsity there is a degeneracy in fit- 
ting D(a): only certain combinations of model parameters 
are determined. We verify that whenever we use these ap- 
proximations in our fits, they agree sufficiently accurately 
with the underlying binomial distribution. 

When the sparsity is small, the binomial distribution for 
a cell's responses is approximately Poisson: 

P(n,celh)^ (pa^e^^. (4) 

This is derived by the use of Stirling's approximation, and is 
valid when a, <C 1 and p 1, which is true in all the cases 
we treat. The approximation depends only on the product 
pon and not on p and separately. 

When the sparsity is so small that we can neglect the 
probability of getting two or more responses, we can use 
what we call the GM-cell approximation: 
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1 - p/Res if n = 0, 
P(n, celli) ~ { p/R c s if n = 1, 
otherwise, 



(5) 



where we have replaced the ultra-small sparsity a, by 1/ i? c ff 
to relate it to our expectations for the sparsity of GM cell 
responses. 

5. Data analysis 



We now analyze the measurements bv lQuian Quiroga et al 
( 2005h . i T or each of their experimental subjects, there was 
first a screening session in which a large number of dis- 
parate images were presented. This was sufficient to detect 
responsive cells, but not to measure their selectivity. Then 
there was a testing session that probed the selectivity by 
using many different images of the same people and objects 
to which responses were found in the screening session. We 
will find useful population information from the screening 
session data alone. 

It would be a big mistake to include the testing session 
data in our analysis, since the images for a testing session 
were systematically chosen to concern people and objects 
whose images evoked a response in the previous screening 
session. The distribution D(a) would be different in the 
two sessions. For example, suppose that in the screening 
session three images of Brad Pitt, Jennifer Aniston and 
Halle Berry evoked responses from three classic GM cells. 
These cells have a very small sparsity 1/R with respect to 
images of people, with R being the subject's repertoire for 
recognizing faces. Then in a testing session that uses equal 
numbers of images for just these three people, each of the 
three GM cells would respond to one third of the images. 
Thus with respect to the new, specially chosen stimulus 
class, these cells have sparsity 1/3. 

We use the following data: 

- Recordings were made from 343 single units and 650 
multi-units. Given the substantial number of single units, 
we propose that the multi-units on average correspond to 
2.5 neurons, to give a total of approximately 2000 cells, 
250 in each of 8 patients. 

The number 2000 is for cells that produced some de- 
tectable signal. However there are many more cells that 
are within range of detection by the extracellular elec- 
trodes used but that failed to give any identified ac- 
tion potentials (Quian Q u iroga, private communication 
and IWavdo et all (|2006h ; iHenze et alj (|2000l ); iBuzsakil 
(|2004h ). Thus we should increase the number of cells by 
some factor K, whose value we will estimate later. Then 
the total number of cells available for detection is 2000iT; 
any of these would have been detected if it had given 
action potentials at rates comparable to that of the ac- 
tually detected cells. Our numerical results will have a 
very simple scaling with K. 

- There were on average p = 93.9 stimuli in each screening 
session. 



- A total of 132 units produced a response above threshold. 

- Of these, 51 were candidate GM cells, i.e., they responded 
to a single image within the screening session. 

- The remaining 81 were not so highly selective. 

- On average, the responsive units responded to 3.1% of 
the presented images, i.e., to 2.9 images. 

Given the low fraction of responsive units, we assume that 
an above-threshold response from a multi-unit is a response 
from one particular cell. 

We will analyze the data with the aid of our general 
expansion, (3). Since we wish to test compatibility with 
the GM cell hypothesis, we arrange our analysis without 
any initial assumption about the necessary existence of GM 
cells: 

(i) First we attempt to make a fit with a conventional 
distributed-code model with a fixed sparsity. 

(ii) When we find this fails to be a good fit, we add a 
second component of different sparsity, as a minimal 
model to fit the data. 

(iii) The second component turns out to have such small 
sparsity that only its responses for n = 1 are signifi- 
cant. 

(iv) This is suggestive that there are indeed cells that 
approximate GM cells. So we reanalyze the data in 
terms of a model of GM-like cells together with a 
distributed-code population, so as to determine ap- 
propriate properties of the GM-like population. This 
makes it easy to allow for issues like the stimuli being 
or not being in the system's repertoire. 

5.1. Single distributed-code population 

We first try a model of a single distributed-code 
population with a sing le sparsity a. This is the model 
((Rolls and Treveslll998f i normally used in theoretical work 
on autoassociative networks. It corresponds to a term 
fT>8(a — a) in the general formula (3). Here /d is the 
fraction of cells in this population, with the remaining 
ne urons being s il ent. ( The most conventional versions, as 



Wavdo et al.l (|2006l ). assume /b = 1.) In the Poisson 



approximation, the probability that a particular neuron 
is in the distributed-code population and that it fires in 
response to n out of p presented images is 

-I 



P(n & D) ~ / D {pa) r 



(6) 



In view of a possible GM-cell population, which would 
appear almost entirely at n = 1 and n — 0, we fit the two 
parameters of the population with data from those cells 
that give n > 2 responses. In App. A we give more details 
of the model including its later elaboration to include a 
GM population, and obtain formulae for two measurable 
quantities. One, P(n > 2), is the probability of getting 
2 or more responses from a cell; its experimental value is 
81/(2000X). The second quantity is the mean value of n 
for these cells, which we write as (n) n >2; its experimental 
value is obtained from 
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Fig. 4. Distribution of cells with a particular number of responses. 
The dark bars correspond to the fitted distributed population and 
the light bar is the excess of data above the distributed-population 
contr ibution. Note that the published data dQuian Quiroga et al.l . 
2005) only enable us to obtain the sum of the bins with n > 2 and 
the mean of n restricted to these bins. So the dark bars represent 
how cells with a distributed coding would give these responses. 

(n) n >i P(n > 1) = 1 x P(l) + (n) n > 2 P(n > 2). (7) 
Hence, from the data 



(n) n >2 = 2.9 x 



132 

"ST 



51 
81 



= 4.1. 



(8) 



From Eqs. (A. 4) and (A. 5) we find that the distributed cells 
are a fraction 

/d = 4.6% 1 (9) 

of all cells, and that pa = 3.7. Hence the sparsity is 

3.7 



93.9 



= 4%, 



(10) 



independently of the value of the factor K for the num- 
ber of silent cells. Given the expected statistical errors on 
the data, due to finite size samples, the relative error on 
these numbers is about 20%. We now have a complete ex- 
perimentally determined prediction for the distribution of 
responses in the distributed population, the dark bars in 
Fig. 4. 

Given this fit with rather sparsely firing cells, we find the 
numbers of cells in this population that give zero and one 
responses: 



n = 
n = 1 

n > 2 



Frac. of all cells 


Num. of cells 


0.11% 


2 ±1.5 


0.41% 


8±3 


4.1% 


81 ±9 



(11) 



These are an extrapolation from n > 2 using a natural 
model for sparse distributed coding. The errors are just 
from the sampling statistics: Given the population param- 
eters, the standard deviation of the number of cells, N n , in 
each bin is just y/N n . 

The measured number of cells in the n = 1 bin is 51, far 
in excess of the extrapolation. Even if there is a distribution 
of sparsity for different cells in the distributed-code pop- 
ulation, this cannot change this deduction greatly: a cell 
that fires in response to at least several percent of stimuli 
is likely to be detected, and relatively few distributed-code 



Fig. 5. The predicted distribution of firing of cells when p = 500 
stimuli are used with the same model parameters as in Fig. 4. The 
distribution is now bimodal, with a clear gap between the distributed 
population response and the GM response, now off-scale, at n = 1. 



cells give just n = 1 and n = responses to the ~ 100 pre- 
sented stimuli. We also note that only around 2% of the 
distributed-code cells fail to get detected: these are the cells 
in the n = bin. When they arc in range of the electrodes, 
detection of the distributed-code cells is almost unbiased. 

We deduce from the excess at n = 1 that there is strong 
evidence for a second population of ultra-sparsely firing 
cells, just as predicted by general considerations if there 
is a GM system. The null model, with only a distributed 
population of cells together with completely silent cells, 
appears to be ruled out. Fig. 4 and our conclusion about the 
n = 1 excess are independent of the number of undetected 
silent cells. They are also independent of any assumption 
that the cells in the second population are actual GM cells. 

Similar evidence for a n excess of sparsely firing cells has 
perhaps been found bv iBarnes et all (|l970l ). Their Fig. 8 
shows an excess for some but not all hippocampal-related 
areas in the rat. 

Of course, a better extrapolation could be made if ex- 
perimental values of P(n) as a function of n were available. 
We could imagine several populations of input cells, acti- 
vated by different kinds of image, so an improved model is 
a combination of several distributions of the form (6), as in 
Eq. (3). 

It has been said that in order to deduce the two pop- 
ulation property from a plot such as Fig. 4, the distribu- 
tion must necessarily be bimodal. Obviously, if we have a 
bimodal distribution, the inference would be cleanest and 
without theoretical prejudice. We illustrate this in Fig. 5, 
where we apply a two-population model described below to 
predict the responses to p = 500 stimuli. There is a GM- 
cell response that remains at n = 1. With a more limited 
set of stimuli, we need a theoretical expectation for the 
distributed-code population to extrapolate to n = 1 from 
the data, so as to quantify a possible excess at n = 1 . Note 
however that if the distributed-code population had a dis- 
tribution of sparsity rather than one fixed sparsity, the peak 
in Fig. 5 would be spread out. 
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5.2. Initial two-population model 



Cells 



Stimuli 



To fit the data we need a minimum of one more popula- 
tion of cells, evidently of much lower sparsity than the first 
population. Without making any initial hypothesis about 
its GM-like nature we try the following ansatz for the spar- 
sity distribution: 

D(a) = (1 - / D ) S(a - a') + / D 6 (a - a). (12) 

where we preserve notation from the previous section, and 
resolve the ambiguity between exchanging the definitions 
of the two populations by requiring a' < a. This model 
has three parameters to be fit to three available measured 
quantities: the fraction of cells with 1 response, the fraction 
of cells with 2 or more responses, and the mean number 
of responses. Within the Poisson approximation, which is 
always good, and with the restriction to the 2000 detected 
cells, we need to solve the following equations 

Vi {l-Mxe-'+foye-v, (13) 



1-fn-fr 



(n)n>l (^1 



N 

Aq + N> 2 

N 

\-N> 2 ) 



N 



= 1 - (1 - f u )e~ x - /be"*, (14) 
= (1 - fo)x + fay. (15) 



From the data, we use N = 2000, Aq = 51, N> 2 = 81, and 
{n) n >i = 2.9. The fit parameters are /d, and the combina- 
tions x = pa' and y = pa. From this we find 

x = 0.0224, y = 3.717, / D = 0.95, (16) 

so that the sparsities are 

a' = 2.3 x 10" 4 , a = 0.039. (17) 

The sparsity of the extra population is so low that it popu- 
lates only the n = 1 (and n — 0) bins to a good approxima- 
tion, even though it concerns 95% of the neurons. Thus the 
properties of the higher sparsity population are essentially 
unchanged from the single-population fit, which confirms 
our original choice to fit it to the data concerning cells with 
2 or more responses to stimuli. 

It now becomes useful to analyze the lower sparsity pop- 
ulation in terms of a GM-cell approximation (5). This will 
give us a simple way of determining whether the popula- 
tion's properties are appropriate for true GM cells. Very 
importantly, it will also give us a simple way of treating 
certain variations in the population's properties that are 
appropriate in the GM-cell context, and of allowing for the 
silent-cell correction factor K. 

5.3. Detailed two-population model 

We therefore model the responses of the cells by a pop- 
ulation that uses a conventional distributed code supple- 
mented by a possible GM-cell population, as illustrated in 
Fig. 6, these comprising fractions /d and /gm of the total 
number of cells. The remaining cells do not respond to any 
stimuli at all in the class used (which we label as "faces" , 
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Fig. 6. Classes of neurons and stimuli in a simple model with GM-like 
cells. A fraction /gm °f the facially-responsive cells are GM-like. A 
second fraction, /d, fire in a distributed and less selective fashion. 
The remaining cells do not respond to any facial stimuli. A fraction 
k of the stimuli are in the repertoire of the GM cells. 



even though some stimuli used in iQuian Quiroga et al 
(2005) were of other kinds); this fraction 1 — /d — /gm is 
completely silent for the purposes of the experiment. We 
let k be the fraction of the images used which correspond 
to a memory in the GM population. 

Our model corresponds to a case of the general expan- 
sion Eq. (3), in which we use just two delta- functions in 
D(a), as in Fig. 3, with the GM-cell approximation (5) used 
for the low sparsity population. To this we add a possible 
population of absolutely silent cells, i.e., a term in D(a) at 
exactly a = 0. 

As explained earlier, we let i? c g be the effective size of 
the population's repertoire, i.e., the repertoire R divided by 
the number of simultaneously evoked memories. For classic 
GM cells i? e ff = R, of course. Then a randomly chosen GM 
cell has a sparsity fc/i? e ff, i.e., the probability for the image 
to be in the system's repertoire times the probability to 
respond to an image in the repertoire. When 2000/<" cells 
are each presented with (an average of) p = 93.9 images of 
different people or objects, the expected number of GM-like 
responses is therefore 2000 X K X 93.9 x f GM X k/R cS . This 
number we found to be 43. In fact the choice of images was 
made after interviews with the subjects (Quian Quiroga, 
private communication), to put them in the repertoires of 
the subjects, so we now set k — 1, to find 

i? cff ~ 4400/gm-^. (18) 

Now /gm is less than 1 — /d = 0.95 to 1, depending on the 
value of K. It is useful to define R — i£ e ff/ {Kf gm), which 
is the combination of parameters actually determined by 
data. Then R cS < (1 - fv)R = 4400K - 200. 

If we ignored the silent-cell issue and set K = 1 we would 
find that at most 4200 categories are coded for. Now the 
detected cells are in areas like the hippocampus that ob- 
viously perform additional functions besides face recogni- 
tion. Also, the repertoire of humans for faces and many 
other c ategories is v ery much larger, as measured behav- 
iorally (|Dudail . ll997f ). This would appear to imply that the 
detected GM-like cells are not classic GM cells. However: 
- The hippocampus is not the ultimate store of long term 

memory, so that the repertoire should be perhaps only a 
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year's worth. 

- The fraction of GM cells, /gm for the chosen stimulus 
class could be less than 1 — /d ■ The remaining cells would 
not respond to any stimuli in the class (e.g., images of 
faces); they might be distributed-code cells or GM cells 
for other classes of stimuli. Extrapolating results on im- 
ages of faces to other stimuli is sensible, so we would 
expect the vast majority of cells, up to a few percent 
corrections to be in whatever GM-like population is ap- 
propriate, so that the upper limit on R c g found above is 
appropriate to be applied to the whole GM population 
and not just those of the facially-responsive kind. 

- Very familiar people might have more neurons allocated 
to them, as required by the storage requirements, and 
some of these could be GM cells. So the assumption of 
equal GM-cell populations for each category could easily 
be false, so that the estimate of categories (e.g., 4200K) 
is only about those categories which are highly familiar 
to the subjects, which brings improved plausibility. 

- There could be an experimental bias in cell selection. 
We made a basic assumption that the cells probed were 
a random sample in these areas. But the properties of 
epilepsy and the selection of suitable subjects for study 
might be such that the electrodes are in areas preferen- 
tially responsive for visual images of people. In that case, 
the category count refers to a restricted set of stimuli, 
improving the plausibility. 

- An interesting possibility is that the cells are not classic 
GM cells but code for memories in a GM sty le, as in the 
model of iBaum. Moodv and Wilczekl (|l988l ). This is of 
course a natural hypothesis for the hippocampus, if we 
grant the GM cell idea at all. 

Then 1/R e g, as estimated above, is the average frac- 
tion of recent memories that concern the people pictured 
in the images. R c s will be biased to relatively low values 
since the images were chosen to be of people well-known 
to the subjects. 

If the cells were classic GM-cells, we can estimate the 
number of cells per exclusive category. There are roughly 
10 7 cells in one region of the human hippocampus. Dividing 
by 4200-ftT gives the number of cells per category, about 
2500/ K. With K = 1 this appears rather large. But with 
the large value of K that we will calculate later, the number 
is quite modest. 

Finally, from the fraction /d = 4.6%/ K of non-GM cells, 
we deduce that they number about 5 x 10 5 /K. An undoubt- 
edly excessively simple idea is to identify them with the 
input representation. This number could obviously much 
higher than the few hundred that are perhaps needed at 
a minimum for coding features relevant for face identifica- 
tion. But there must obviously be many other specialized 
and less specialized kinds of distributed input representa- 
tion, as well as distributed output representations. 

Beyond approximating the general sparsity distribution 
D(a) by a sum of a small number delta functions, our calcu- 
lations used the Poisson approximation for the distributed- 
code population and the GM approximation for the ultra- 



low sparsity population. We have verified that using the 
full binomial distribution does not significantly affect the 
results. 



5.4. Silent-cell correction 

The experime ntal estimates of the number s of single and 
multiple units in lQuian Quiroga et al.l (|2005j) depended on 
detection of spikes from the neurons, even if the spike num- 
bers never passed the limit for the defined threshold for 
a responsive neuron. However, within detection range of 
the extracellular elect rodes used are many cells that never 
give a detected signal. Wavdo et al. ( 20061 ) state that in the 
data 1-5 un i ts are identified per electrode and they cite 
Henze et al. I (l2000h for an estimate that that 120-140 neu- 
rons are within detection range. To get a first estimate we 
can say that on average 2 units are detected per electrode 
out of 130 possible neurons, so that the neuronal popula- 
tion is about 65 times the number of units. 

We alrea dy estimated that about the n umber of neurons 
detected in iQuian Quiroga et al.l ((2005) was about twice 
the number of units, to give a total of about 2000 detected 
neurons. So we should further multiply the number of neu- 
rons by K — 65/2 ~ 30, for a total of 60 000 neurons in 
range of detectability. 

This increase does not affect our estimate of the spar- 
sity of the response of the distributed-code neurons; that 
stays at 4%. It also does not affect our estimate of the 
relative numbers of detected cells in our two populations 
(distributed-code v. GM-like). But it does drastically de- 
crease the fraction of the distributed-code population to 
4.6%/ K ~ 0.015%. 

Most importantly it increases our estimate of the num- 
ber of categories coded. The basic quantity here is i? c ff — 
4400JT ~ 10 5 . Given the roughness of our calculations, this 
is probably accurate to a factor of 2 or so. The key issues 
concern the orders of magnitude. 



6. Comparison with previous measurements of 
distributed representations 

6.1. Way do et al. 



Waydo et al. ( 20061 ) work with data that is evidently 



a superset of the data that the same group published in 
Quian Quiroga et al.l (|2005r ) and to which we made a two- 
population fit. They make a fit with a one-population model 
that is the same as ours in the special case that all detected 
cells are in the distributed-code population: /gm = 0, /d = 
1. One superficial difference is that they use the exact bino- 
mial distribution instead of the analytically more tractable 
Poisson approximation; this makes a negligible difference 
for the small sparsitics in question. Another difference is 
that they normalize to units rather than neurons; but this 
only results in trivial scalings of certain parameters. 
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N = 42 units 


Waydo et al 


(2006) 




Our fit 














Best 




J OO SUIllUll 


Data 


ax = 0.23% 




84 neurons 


105 neurons 


126 neurons 


N r Responsive units 


7.9 


7.7 


15.9 


5.5 


6.9 


8.3 


S r Evocative stimuli 


16.4 


8.1 


17.9 


14 


17 


20 


St n r >2 Fraction of stimuli with N T > 2 


4.1% 


0.44% 


2.2% 


1.4% 


2.1% 


2.9% 



Table 1 

Comparison of the per session data in IWavdo et all d2006l) with the results of their one-population model, and with the predictions of our two- 
population model, using its already-determined population parameters. In the two-population model, the number of neurons corresponding 
to 42 units is adjustab le, with the best result in the middle column. The data as well as the fit in the column headed a\ = 0.54% are those 
in IWavd o et al.l I l2006h . For comments on the predictions in the last line, see the text. 



They performed a Bayesian fit to the session-by-session 
data on the joint probability of measuring N T responsive 
units and S r evocative stimuli in a set of N units with S 
presented stimuli. The probabilities in the model can be de- 
rived from the underlying probability distribution (3), with 
one further critical assumption, that firing in the different 
neurons is independent. 

As c an be seen from the on-line supplementary mate- 
rial for IWavdo et al. <|2006h . not only is is quite difficult 
to derive the distribution from the underlying distribu- 
tion for the response of one neuron, but the resulting for- 
mula involves a delicate cancellation of opposite-sign terms. 
The formula therefore needs extreme care in numerical 
work. Howeyer cer tain averaged quantities considered in 
Wavdo et al. ( 20061 ) are easier to derive, and in App. B we 
present derivations that apply also to our two-population 
model, given only the extra assumption of independence of 
different neurons. 

The main result of the analysis in IWavdo et al. (|2006l) 



is a distribution for sparsity, which is to be interpreted as 
a posterior distribution in the Bayesian sense for the sin- 
gle sparsity a\ of the neural population. We use the sym- 
bol ai for the sparsity to avoid confusion with the spar- 
sity parameter of the distributed-code population in our 
fits. When the same criteri on for a neural response as in 
Quian Quiroga et al.l (|2005l ) is used, the peak of the dis- 
tribution is at ai — 0.23%, which is therefore the best fit 
according to the usual maximum likelihood criterion. The 
distribution has a long asymmetric tail to large a\ and the 
average value of the posterior distribution, at ai = 0.54%, 
is also a useful estimate of the value of a\. 

These values are much lower than the value of a in our 
two-population model, as is natural if the fit is to be a com- 
promise between matching an ultra-low sparsity GM-like 
population and a higher sparsity distributed-code popula- 
tion. We see this quantitatively in Table 1, where we show 
data and the results of their mo del fit and the predic tions 
of our model. The data are from IWavdo et alJ (|2006l ) and 
are an average over 34 sessions. One datum is the average 
number of units N t in one session that respond to at least 
one stimulus, out of an average of S = 88 stimuli. Another 
datum is the average number of stimuli S r to which at least 
one unit responds in a session, out of an average of N = 42 
detected. 



The one-population model evidently has a choice, to have 
a relatively low sparsity to get the correct number of respon- 
sive units, or to have a relatively high sparsity to get the 
correct number of evocative stimuli. Our two-population 
model does considerably better. We can improve its results 
by increasing the number of neurons a bit more relative to 
the number of units than we originally supposed. Note that 
from the first line of the table, the fraction of responsive 
units is measured to be 7.9/42 = 19%. This is substan- 
tially higher than the me asurement for the same frac tion 
132/993 = 13% given in buian Quiroga et all (|2005h . So 
the data are not completely consistent. 

A fi nal piece of session-averaged data given in lWavdo et al.l 
(2006) is the fraction of stimuli that produced a (simulta- 
neous) response in at least two neurons. In in App. B, we 
derive a formula for this quantity. As with the number of of 
evocative stimuli, a neglect of neuron-neuron correlations 
is needed. The results are shown in the l ast lin e of Table 
1. As already observed in I Wavdo et al. (2006), the one- 
population model with their preferred value a± = 0.54% 
gives a fraction 2.2% that is rather below the data (4.1%). 
A comparably bad fit is obtained by our two-population 
model. 

In fact, the bad fit happens quite generally. The formulae 
for both S T and S^, n r >2 given them in terms of a single 
property of the model, the cell- averaged sparsity a. We 
show in App. B, that when aN is not two large, the two 
quantities obey an approximate relation 



S 



1 (Sr 

2\S 



(19) 



This relation is obeyed to useful accuracy in the model 
calculations in the last two lines in Table 1, but it is violated 
by a factor of two by the data. The derivation is not affected 
by adding yet more populations of different sparsities, but 
only by including neuron-neuron correlations. 

There are in fact two simple ways to overcome this prob- 
lem. One is simply that one fraction is proportional to the 
number of detected units in a session, and the other is 
proport ional to its square. S ince this number varied quite 
widely ( Wavdo et al. , 20061 ) between sessions (18 to 74), 
the session average of N 2 cannot be replaced by the square 
of the average of N. This could easily account for the fac- 
tor of two mismatch. In contrast, the fraction of evocative 
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Responsive neurons 


with n = 1 


51 




174 






305 


Exact 


Responsive neurons 


with n > 2 


81 




20 






92 


Exact 


{n) n >2 




4.1 




2.1 






2.2 


Exact 



Table 2 

Comparison of results of Qui an Quiroea et al.l ||2005|) with the predictions of the one-population model of IWavdo et al.l d2006T ). Note that 
the three parameters of our two-population model are determined from the three data values in this table. 



stimuli is amenable to a simple average over sessions. 

The second possibility is from neuron-neuron correla- 
tions, to which S'r l n r >2 is much more sensitive than the 
other observables. If there were a small fraction of nearby 
neurons that always fired in pairs, these would dispropor- 
tionately contribute to S T n r >2, but not nearly as much to 

Sr. 

Of course, both these suggestions can be tested by a 
closer examination of the data. 

We also examine how well the one -population model, 



with the parameters from lWavdo et al 



2006 ), agrees with 



Quian Quiroga et all 



the d ata we used from the earlier paper I 
2005j) . This is shown in Table 2. It can be seen that the 
observables we used are particularly sensitive to the differ- 
ences between the models. A low average sparsity is neces- 
sary to keep the number of responsive neurons down to the 
experimental value. But with a one-population model this 
also implies that neurons with n > 2 responses are many 
fewer than those with n = 1 responses. Moreover the num- 
ber of cells with even more responses than 2 is minute, so 
that (n) n >2 is close to its minimum value of 2, whereas the 
data is much higher. This is a clear indication of the need 
for two populations of cells with very different sparsities. 

We conclude that working with the distribution of the 
number of responses by individual neurons (or units), as 
we do, is preferable to the other distributions. The distri- 
bution P(n\p) is easy to work with, and data can be use- 
fully aggregated over a whole experiment, given only that 
the number of stimuli is approximately the same in each 
session and that the stimuli are chosen at random in some 
large class. These conditions can be imposed by an experi- 
menter. The model can always be systematically improved 
by changing the sparsity distribution, e.g., by adding extra 
components. Working with aggregate data keeps the sam- 
pling errors usefully low, and parameters for a model can 
be computed simply from properties of the aggregate data. 



6.2. Abbott, Rolls, and Tovee 



Oth er analyses of data, e.g., I Abbott. Rolls and Tovee! 
(Il996h . have reported that hippocampal facially responsive 
cells carry a distributed code as opposed to a G M-type 
code. See also the recent work of lHung et al. I (|2005h . These 
analyses might appear to contradict our calculation that 
distributed-code cells are a very small fraction, perhaps 



less than 0.2% of the total. However, there is a strong bias 
against actually detecting GM-like cells. In this section, 
we use our fit to the more recent data to quantify this bias, 
at least roughly, to determine whether there is consistency 
between our results and the earlier data. 

The primary issue is that experiments typically only re- 
port those cells that are actually detected to respond to 
at least one of the stim uli used. For exampl e , in a paper 
documenting place cell iThompson and Best state 
"the electrode assemblies were advanced until one or more 
hippocampal complex-spike cells were isolated extracellu- 
larly." Then they observe that cells that do not produce 
any detectable spikes "are excluded from analysis here due 
to our lack of ability to detect them" . Since GM-like cells 
respond to a very small fraction of stimuli, the ones that 
respond to no presented stimulus, i.e., the vast majority, 
are typically omitted from an analysis. 

The resulting bias can be seen in the data that we 
analyzed earlier. A minor ity of the detected cells in 



Quian Quiroga et al.l (|2005l ) are in the GM-like class (43 
out of 132), even though we have shown that the GM-like 
cells can be in the vast majority (99% or more). 

With fewer stim uli, the bias becomes even stron ger, as 
in the data used bv I Abbott. Rolls and Toved (jl996h . They 
used 20 face stimuli, and the total number of facially respon- 
sive neurons was 14. A rather higher sparsity was reported 
than our result. But this is partly because a different def- 
inition of sparsity was used, applied to the spike numbers 
rather than to a binary response criterion. Furthermore the 
cells have considerably larger background firing rates than 
those in the new data. 

Despite the differences in cells and species, we blindly 
apply our model to give a rough test of consistency. In 
our two-population model, the fraction of GM cells in the 
detected cells is 



p/R 



p/R + fn (1 - e-*») i + ™ (1 _ e -o.o4 P) 



(20) 



P 



This is the probability that a cell is a GM-like cell con- 
ditional on the cell producing a detected response to one 
or more of p stimuli. Notice that the silent-cell ratio K 
cancels in this formula; we have a relation between num- 
bers of different kinds of detected cell under different ex- 
perimental conditions. We have estimates for the param- 
eters of the model, so substituting p = 20 predicts a de- 
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tected GM-cell fractio n of 0.15, i.e., about 2 cells , in the set 
of ce lls investigated in Abbott ^ Rolls and Tovee (199(3). In 
fact , I Abbott. Rolls and Toved(|l996h did report that two of 
their cells had a GM-like response. There is, of course, no 
significance to the fact that this number is exactly the value 
predicted: there are expected statistical fluctuations, and 
the measurements were do ne with different methods an d in 
a different species than in lQuian Quiroga et all lj2005h . 

Nevertheless it is very important that the previous re- 
port, viewed as evidence in favor of distributed representa- 
tions, is completely compatible with GM cells being in the 
vast majority, with the parameters we have determined. 
The undetected GM cells simply appear to be silent within 
the experiment and are therefore classed as not facially re- 
sponsive. Statements about the neural representation be- 
ing distributed apply only to (most of) those cells that the 
measurements actually detected, not to all cells in the rel- 
evant region of the brain. 



7. Discussion 

The results of Quian Quiroga et al. ( 20051 ) clearly 
suggest the detection of grandmother cells in the clas- 
sic sense. Many other experiments have detected in- 
dividual cells with strikingly specifi c responses (e.g 



bers of cells of diffe rent kinds. We thereby solv e some of 
the issues raised by lOlshausen and Field! (|2005h concern- 
ing the publication of data only about responsive cells. One 
primary remaining bias is that different neurons may have 
different electrical characteristics, with a consequent differ- 
ent maximum distance from the electrodes for detectability 
of spikes. But this is presumably a milder effect than that 
caused by orders of magnitude differences in sparsities. 



7.1. Two populations essential 

From the data we find indeed that the two-population 
property is obeyed. Not only does the ultra-low-sparsity 
population comprise the vast majority of cells in the brain 
regions concerned (hippocampus, etc), but its sparsity can 
be in a range compatible with the hypothesis of a GM-like 
system: Roughly 10 _3 % with a repertoire of 10 5 . 

An important role is played by the many silent cells. It 
is obviously unreasonable to assume they have no function. 
But on the GM-cell hypothesis they naturally are to be 
interpreted as the majority of GM cells that are not relevant 
to the particular stimuli used in an experiment. The large 
number of these cells is what enables one to overcome the 
strong biases against detecting a response of any one GM 
cell to a limited set of stimuli. 



Hah nloser. Kozhevnikov and Fed()2002l) : Jung and McNaughton | Now the group responsib l e for the ana l yzed data argue 

( Quian Quiroga et al. . 20051 : Wavdo et al. . 2006) that their 



(|l993f) : lThompson and Best! (|l989f ) therefore it is useful 
to hypothesize that some of these cells are indeed GM-like 
cells, even though the concept of GM-cell may need to be 
extended and modified. 

A purely experimental direc t test of the idea needs too 
many stimuli to be practical, cf. lChurchland and Seinowski 
( 19921 p. 179). So other arguments must be brought in, 
of which we have provided two. One uses an estimate of 
the actual storage requirements for a memory system. We 
showed that GM systems can be optimally efficient in the 
use of synapses and neurons. The usual efficiency argument 
applies only to the input representation, but now carries 
the implication that in a GM-like system there must be two 
populations of cells with widely different sparsities. 

Our second argument is a method to analyze neural re- 
sponses. A particular aim is to measure whether they are 
quantitatively consistent there being separate neurons cod- 
ing for each recognized person, or, alternatively, for each 
individual declarative memory. Our method enables one to 
determine whether or not individual cells necessarily code 
for multiple persons or memories. We derived a general for- 
mula Eq. (3) for the neural responses in terms of an under- 
lying distribution of sparsity. Our expansion is a new result 
and is applicable independently of any detailed theory or 
model of neural function. 

In effect, the formula enables us to extrapolate from lim- 
ited data to obtain the fraction to stimuli to which cells re- 
spond, ft also allows us to compensate for the strong biases 
involved in detecting cells when sparsities differ by very 
large factors. Thus we obtain valid estimates of the num- 



data do not support the GM cell idea. In iWavdo et al 



(|2006h . they say "if we assume that a typical adult recog- 
nizes between 10,000 and 30,000 discrete objects (Bieder- 
man, 1987), a — 0.54% implies that each neuron fires in 
response to 50 - 150 distinct representations." [a should be 
replaced by a\ in the notation of the present paper.] 

However their analysis assumed a single value of spar- 
sity. While this is a suitable approximation for conventional 
mechanisms of distributed memory, it is very bad for GM- 



like sy stems. Even though the explicit aim of IWavdo et al 



(2006J) was to test the GM-cell hypothesis, the use of a sin- 
gle sparsity in effect imposed an assumption that the hy- 
pothesis is wrong. 

We showed that the single-population hypothesis is a bad 
fit to the data. Since our expansion (3) is very general, the 
fault is in the single-population hypothesis not in any as- 
sumption about neural properties. The rather low value of 
sparsity given by Waydo et al. is merely a compromise be- 
tween the widely different sparsities of the two populations. 
Our results are consistent with an even hig her number 
of rec ognized objects than in the estimates of iBiederman 
( 19871 ). Indeed, even without allowing for the silent cell cor- 
rection, our fits allow a GM-cell population with a sparsity 
of 1/4200 = 0.024% corresponding to a number of objects 
not far from the lower edge of Biederman's range. 

Note that our basic estimate of the number of cate- 
gories, 10 5 , assumes that the cells are classic GM cells, 
each responding to a single individual person. But the 
number of categories could be substantially higher. If the 
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ce lls are general memory cells , in th e style of the model 
of iBaum. Moodv and Wilczek (|l988h . they could respond 
to images of several people. It could also be that more 
familiar stimuli, with richer associations, have more cells. 
In that case measurements with familiar stimuli, as is the 
case in the data, would be biased towards these memories 
with unusually large numbers of cells, with a correspond- 
ing reduction in our estimate of the number of categories 
compared with the true number. 

7.2. Anatomy 



In GM systems, like the model o flBaum. Moodv and Wilczek 
(1988), the number of GM cells is very much larger than 
that of the input cells, as is consistent with the numbers we 
have deduced. An immediate implication is that each GM 
cell receives input from a modest number of input cells, 
but that each input cell sends output to a much larger 
number of memory cells. Given also our finding that the 
GM cells in the relevant regions are in the vast majority, 
there are some striking anatomical implications. 

In fact striking dispariti es in synapse number are well 
know n in the hippocampus (jAmaral. Ishizuka and Claiborne! . 
1973): For example, each CA3 pyramidal cell gets about 



a part of the image. But this property can also be true 
for GM- like systems. For example, when t he BMW archi- 
tecture ( Baum. Moodv and Wilczekl . 1988h is used with a 
sparse input representation and suitable dynamics for its 
GM cells, it also performs pattern completion; the com- 
pletion property is actually associated with properties of 
sparse representations used for input data. 

It has been said that new memories are harder to con- 
struct in GM systems than in distributed-memory systems. 
But now that adult neurogenesis in the hippocampus is well 
established, it may well be that there is actually a pool of 
new neurons available for at least some uses that could in- 
clude being GM-like cells for new memories. The new neu- 
ron rate may however be excessively small. In addition, it 
is possible that the GM nodes are on dendritic tree rather 
than being whole neurons. It is known that there can be 
substantial changes in dendritic topology, which could eas- 
ily include the formation of new nodes. Here the fundamen- 
tal mode of operation is of a GM-like system while the neu- 
ral code of memory neurons takes on some of the aspects 
of distributed memory. In any case, there are potential re- 
alistic mechanisms for the formation of new GM nodes, so 
that there is no insuperable obstacle here. 



50 input synapses from dentate granule cells, while other 
connections have tens of thousands of synapses. Note 
that hippocampal neurogenesis results in dentate granule 
cells, highly appropriate if they are GM-like. However, 
general-purpose memories need a wider variety of (pro- 
cessed) input than does a face recognition system, and 
hippocampal-related regions are sufficiently complex that 
the real picture is undoubtedly much more complicated. 
Even so, a careful analysis of the disparities in synapse 
number should provide critical information on neural func- 
tion and the viability of GM-like systems. 

7.3. Other arguments against GM systems 

Other less quantitative arguments have been advanced 
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Appendix A. Model of neural firing 

Our model for the statistics of neural firing has two cell 
populations: one that uses a conventional distributed code 



agains t the reality of GM s ystem s, e.g., Churchland and Seinowa kith a single sparsity a and a second GM-cell population, 
(|1992|) : IRoT1s and Treves] (l998l) . 

For example, distributed memory systems are said to be 
robust against partial destruction, since there is no single 
location for a single memories. But we do know that mem- 
ories disappear. If there are multiple GM cells for a mem- 
ory in different places, then we can overcome the robust- 
ness argument by simple redundancy. Moreover memories 
form a network of knowledge, so that individual items of 
semantic memory can be readily reconstructed from other 
knowledge. Episodic memory is really an ordered sequence 
of individual episodes, not necessarily remembered at all 
precisely. Any one episode that disappears can be approx- 
imately filled in from neighboring episodes. 

Distributed memory systems are also said to be good at 
filling in missing parts of input data, as in reconstructing 
a full remembered image from a stimulus containing only 



as illustrated in Fig. 6. These from fractions /d and /gm of 
the total number of cells. A remaining population of cells 
does not respond to any stimuli at all in the class used in 
the experiment. We let k be the fraction of the images used 
which have stored representations in the GM population, 
we let R be the repertoire of the GM cells, and we let n m 
be the typical number of categories (or GM groups) evoked 
by a stimulus in the system's repertoire. As before, we let 
R e s = R/n m . 

Suppose first we record from some random cell known to 
be in the distributed population. We have seen that when 
we present p images, the probability of getting n responses 
is approximately the Poisson distribution in Eq. (2) with 
a = a. 

If, instead, we pick a GM cell, then for each individual 
image it has a probability kn m /R of responding. Therefore 
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over a set of p <C R e s unrelated images it has a probability 
kp/R e ff of responding exactly once. There is a negligible 
probability of n > 2 for such a cell. 

Finally, if the cell is outside the above two populations, 
it is silent in the experiment and always gives n = 0. 

Summing over the distributions of n for cells of the dif- 
ferent kinds, weighted by their fractional population size, 
gives 



Appendix B. Further applications of model 
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if n = 0, 
if n = 1, 
if n > 2. 



(A.l) 



This is of the form of the general distribution Eq. (3) with 

D(a) = fuS(a ~ a) + f GM kS(a - l/i? e ff) 

+ (l-ib-/GM*)*(a), (A.2) 

and with the Poisson approximation approximation for the 
distributed-code population and with the GM approxima- 
tion that the GM-like cells fire so rarely that their responses 
for n > 2 can be neglected. 

There are several meaningful parameters for the GM pop- 
ulation, but only one combination affects the distribution 
P(n). So we define R = R c e/(kf GM ) = R/(n m kf GM ), to 
find 



P{n) 



|-/ D (l-e-^) 



^ + f D pae-P a 

R 

Jv{paTe-P a - 



if n = 0, 

if 7i = l, (A.3) 
if n > 2. 



The parameter R has the meaning that if a cell is outside the 
distributed-code population then it responds to a stimulus 
in the chosen global class with probability 1/[R (1 — jb)] — 
1/R, where the last approximate equality applies in the 
realistic case that /d is small, according to our fit. 

We wish to extract the properties of the distributed-code 
cells without contamination from the GM cells. For that 
we need properties of the distribution for n > 2, for which 
we use the probability and the mean number of responses. 
The probability of n > 2 is 

P(n > 2) = / D [1 - (1 + P a)e- pa ] . (A.4) 

The mean number of responses, in cells with n > 2, is 

J2 n >2 nP ( n ) 



{n) n >2 



P(n > 2) 
pa(l - e- pa ) 
1 - (1 +pa)e~P a ' 



(A.5) 



These last two equations suffice to deter mine a and /d from 
the data in lQuian Quiroga et al.l (|2005f ) — see Eqs. (9) and 
(10). 



Sev eral further observables are considered bv lWavdo et al 
(|2006l h These observables refer to a session in which S 
stimuli are presented to N c cells or N units. 

One observable is the number of neurons A r that respond 
to at least one stimulus. Its average is just the number of 
cells or units times the probability that one cell or unit re- 
sponds, which in our notation is P(n > 1\S), in the nota- 
tion of Eq. (3). Hence t he number of respo nsive units in 
one-population model of lWavdo et a l. (2006) is 

P(n > l\S) X # units = (1 - e~ Sai )7V. 



In our two-population model it is 
P(n>l\S)xN c = ^-+i<r/ D (l 



-S(L' 



(B.l) 



A^c- (B.2) 



We are not quite sure how many cells correspond to each 
unit in the new data, so in Table 1 we gave results for 
several choices of the ratio of cells to units, N c /N: 2 (as we 
estimated for the earlier data), 2.5, and 3. 

A second observable is the number St of stimuli that 
evoked a response in at least one neuron in the session. 
To derive this from the response distributions requires a 
further assumption that correlation between the firing of 
different detected neurons can be neglected. 

Now the probability of one stimulus evoking no response 
in any of A independent cells (or units) is P(0\1) N , where 
P(0|1) is the probability of no response in one cell/unit on 
presentation of 1 stimulus. Hence the average number of 
evocative stimuli in a session is 

Sr = S [1-P(0|1) N ] . (B.3) 

From Eq. (3), we find that in general P(0\ 1) = 1 — a, where 
a is the s parsity averaged ov er cells. In the one-population 
model of lWavdo et al. ( 2006f l we therefore get 

(B.4) 



S t = S[l-(l-a 1 ) N ] 
while in our two-population model it is 

K 



S r = S 



1-1- 



R 



Kf D a 



(B.5) 



The appearance of K in this last formula is misleading: the 
dependence on K of R and /d in our fit cancels the explicit 
factor of K . We have used the number of cells A c in this 
formula rather than the number of units A, since our fit is 
made with respect to cells. 

A final observable we consider is the number of stimuli 
in a session that evoked responses in 2 or more cells/units. 
This quantity, denoted S , r ,n r >2 can be obtained from the 
distribution of the number of neurone responding to a single 
stimulus: 

AT' 

P{n r for 1 stim.) = a n * (1 - a) N - n * 



n t \{N -n T )\ 
[A(l-q)]"r N(1 _ &) 
n r 



(B.6) 
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It follows that on average 

S It „ r > 2 = S [1 - (1 - a) N - Na(l - a)^ 1 ] . (B.7) 

From Eqs. (B.5) and (B.7), we get a relation between S T 
and S Tj n r >2 valid when Na is less than about unity and N 
is substantially larger than unity. We expand the powers of 
1 — a for small a to obtain 

^~Na HNa<l, (B.8) 

and then 
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